The 2019-2020 Coronavirus Pandemic Analysis

Contact: Smith Research

BACKGROUND & APPROACH

I wanted to track and trend the coronavirus outbreak on my own curiosity. There are some interesting questions that may fall out of this, as it is a very historic moment, including scientifically and analytically (we have a large amount of data being shared across the globe, analyzed in real-time). The world has come to a halt because of it.
This analysis attempts to answer the following questions (more to come):

  1. What does the trend of the pandemic look like to date?
  2. What are future case predictions based on historical model?
  3. What interesting quirks or patterns emerge?

ASSUMPTIONS & LIMITATIONS: * This data is limited by the source. I realized early on that depending on source there were conflicting # of cases. Originally I was using JHU data… but this was always ‘ahead’ of the Our World In Data. I noticed that JHU’s website was buggy- you clicked on the U.S. stats but it didn’t reflect the U.S.. So I changed data sources to be more consistent with what is presented in the media (and Our World In Data has more extensive plots I can compare my own to). An interesting aside might be why the discrepancy? Was I missing something?
* Defintiions are important as is the idea that multiple varibales accumulate in things like total cases (more testing for example).

SOURCE RAW DATA: * https://ourworldindata.org/coronavirus
* https://github.com/CSSEGISandData/COVID-19/
*

INPUT DATA LOCATION: github (https://github.com/sbs87/coronavirus/tree/master/data)

OUTPUT DATA LOCATIOn: github (https://github.com/sbs87/coronavirus/tree/master/results)

TIMESTAMP

Start: ##—— Fri May 22 11:29:03 2020 ——##

PRE-ANALYSIS

The following sections are outside the scope of the ‘analysis’ but are still needed to prepare everything

UPSTREAM PROCESSING/ANALYSIS

  1. Google Mobility Scraping, script available at get_google_mobility.py
# Mobility data has to be extracted from Google PDF reports using a web scraping script (python , written by Peter Simone, https://github.com/petersim1/MIT_COVID19)

# See get_google_mobility.py for local script 

python3 get_google_mobility.py
# writes csv file of mobility data as "mobility.csv"

SET UP ENVIORNMENT

Load libraries and set global variables

# timestamp start
timestamp()
## ##------ Fri May 22 11:29:03 2020 ------##

# clear previous enviornment
rm(list = ls())

##------------------------------------------
## LIBRARIES
##------------------------------------------
library(plyr)
library(tidyverse)
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.0 ──
## ✓ ggplot2 3.3.0     ✓ purrr   0.3.3
## ✓ tibble  3.0.0     ✓ dplyr   0.8.5
## ✓ tidyr   1.0.2     ✓ stringr 1.4.0
## ✓ readr   1.3.1     ✓ forcats 0.5.0
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## x dplyr::arrange()   masks plyr::arrange()
## x purrr::compact()   masks plyr::compact()
## x dplyr::count()     masks plyr::count()
## x dplyr::failwith()  masks plyr::failwith()
## x dplyr::filter()    masks stats::filter()
## x dplyr::id()        masks plyr::id()
## x dplyr::lag()       masks stats::lag()
## x dplyr::mutate()    masks plyr::mutate()
## x dplyr::rename()    masks plyr::rename()
## x dplyr::summarise() masks plyr::summarise()
## x dplyr::summarize() masks plyr::summarize()
library(ggplot2)
library(reshape2)
## 
## Attaching package: 'reshape2'
## The following object is masked from 'package:tidyr':
## 
##     smiths
library(plot.utils)
library(utils)
library(knitr)

##------------------------------------------

##------------------------------------------
# GLOBAL VARIABLES
##------------------------------------------
user_name <- Sys.info()["user"]
working_dir <- paste0("/Users/", user_name, "/Projects/coronavirus/")  # don't forget trailing /
results_dir <- paste0(working_dir, "results/")  # assumes diretory exists
results_dir_custom <- paste0(results_dir, "custom/")  # assumes diretory exists


Corona_Cases.source_url <- "https://github.com/CSSEGISandData/COVID-19/raw/master/csse_covid_19_data/csse_covid_19_time_series/time_series_covid19_confirmed_global.csv"
Corona_Cases.US.source_url <- "https://github.com/CSSEGISandData/COVID-19/raw/master/csse_covid_19_data/csse_covid_19_time_series/time_series_covid19_confirmed_US.csv"
Corona_Deaths.US.source_url <- "https://github.com/CSSEGISandData/COVID-19/raw/master/csse_covid_19_data/csse_covid_19_time_series/time_series_covid19_deaths_US.csv"
Corona_Deaths.source_url <- "https://github.com/CSSEGISandData/COVID-19/raw/master/csse_covid_19_data/csse_covid_19_time_series/time_series_covid19_deaths_global.csv"

Corona_Cases.fn <- paste0(working_dir, "data/", basename(Corona_Cases.source_url))
Corona_Cases.US.fn <- paste0(working_dir, "data/", basename(Corona_Cases.US.source_url))
Corona_Deaths.fn <- paste0(working_dir, "data/", basename(Corona_Deaths.source_url))
Corona_Deaths.US.fn <- paste0(working_dir, "data/", basename(Corona_Deaths.US.source_url))
default_theme <- theme_bw() + theme(text = element_text(size = 14))  # fix this
##------------------------------------------

FUNCTIONS

List of functions

function_name description
prediction_model outputs case estumate for given log-linear moder parameters slope and intercept
make_long converts input data to long format (specialized cases)
name_overlaps outputs the column names intersection and set diffs of two data frame
find_linear_index finds the first date at which linearaity occurs
##------------------------------------------
## FUNCTION: prediction_model
##------------------------------------------
## --- //// ----
# Takes days vs log10 (case) linear model parameters and a set of days since 100 cases and outputs a dataframe with total number of predicted cases for those days
## --- //// ----
prediction_model<-function(m=1,b=0,days=1){
  total_cases<-m*days+b
  total_cases.log<-log(total_cases,10)
  prediction<-data.frame(days=days,Total_confirmed_cases_perstate=total_cases)
  return(prediction)
}
##------------------------------------------

##------------------------------------------
## FUNCTION: make_long
##------------------------------------------
## --- //// ----
# Takes wide-format case data and converts into long format, using date and total cases as variable/values. Also enforces standardization/assumes data struture naming by using fixed variable name, value name, id.vars, 
## --- //// ----
make_long<-function(data_in,variable.name = "Date",
                   value.name = "Total_confirmed_cases",
                   id.vars=c("case_type","Province.State","Country.Region","Lat","Long","City","Population")){

long_data<-melt(data_in,
                id.vars = id.vars,
                variable.name=variable.name,
                value.name=value.name)
return(long_data)

}
##------------------------------------------

## THIS WILL BE IN UTILS AT SOME POINT
name_overlaps<-function(df1,df2){
i<-intersect(names(df1),
names(df2))
sd1<-setdiff(names(df1),
names(df2))
sd2<-setdiff(names(df2),names(df1))
cat("intersection:\n",paste(i,"\n"))
cat("in df1 but not df2:\n",paste(sd1,"\n"))
cat("in df2 but not df1:\n",paste(sd2,"\n"))
return(list("int"=i,"sd_1_2"=sd1,"sd_2_1"=sd2))
}

##------------------------------------------

##------------------------------------------
## FUNCTION: find_linear_index
##------------------------------------------
## --- //// ----
# Find date at which total case data is linear (for a given data frame) 
## --- //// ----

find_linear_index<-function(tmp,running_avg=5){
  tmp$Total_confirmed_cases_perstate.log<-log(tmp$Total_confirmed_cases_perstate,2)
  derivative<-data.frame(matrix(nrow = nrow(tmp),ncol = 4))
  names(derivative)<-c("m.time","mm.time","cumsum","date")
  
  # First derivative
  for(t in 2:nrow(tmp)){
    slope.t<- tmp[t,"Total_confirmed_cases_perstate.log"]- tmp[t-1,"Total_confirmed_cases_perstate.log"]
    derivative[t,"m.time"]<-slope.t
    derivative[t,"date"]<-as.Date(tmp[t,"Date"])
  }
  
  # Second derivative
  for(t in 2:nrow(derivative)){
    slope.t<- derivative[t,"m.time"]- derivative[t-1,"m.time"]
    derivative[t,"mm.time"]<-slope.t
  }
  
  #Compute running sum of second derivative (window = 5). Choose point at which within 0.2
  for(t in running_avg:nrow(derivative)){
    slope.t<- sum(abs(derivative[t:(t-4),"mm.time"])<0.2,na.rm = T)
    derivative[t,"cumsum"]<-slope.t
  }
  
  #Find date -5 from the stablility point
  linear_begin<-min(derivative[!is.na(derivative$cumsum) & derivative$cumsum==running_avg,"date"])-running_avg
  
  return(linear_begin)
}

READ IN DATA

# Q: do we want to archive previous versions? Maybe an auto git mv?

##------------------------------------------
## Download and read in latest data from github
##------------------------------------------
download.file(Corona_Cases.source_url, destfile = Corona_Cases.fn)
Corona_Totals.raw <- read.csv(Corona_Cases.fn, header = T, stringsAsFactors = F)

download.file(Corona_Cases.US.source_url, destfile = Corona_Cases.US.fn)
Corona_Totals.US.raw <- read.csv(Corona_Cases.US.fn, header = T, stringsAsFactors = F)

download.file(Corona_Deaths.source_url, destfile = Corona_Deaths.fn)
Corona_Deaths.raw <- read.csv(Corona_Deaths.fn, header = T, stringsAsFactors = F)

download.file(Corona_Deaths.US.source_url, destfile = Corona_Deaths.US.fn)
Corona_Deaths.US.raw <- read.csv(Corona_Deaths.US.fn, header = T, stringsAsFactors = F)

# latest date on all data:
paste("US deaths:", names(Corona_Deaths.US.raw)[ncol(Corona_Deaths.US.raw)])
## [1] "US deaths: X5.21.20"
paste("US total:", names(Corona_Totals.US.raw)[ncol(Corona_Totals.US.raw)])
## [1] "US total: X5.21.20"
paste("World deaths:", names(Corona_Deaths.raw)[ncol(Corona_Deaths.raw)])
## [1] "World deaths: X5.21.20"
paste("World total:", names(Corona_Totals.raw)[ncol(Corona_Totals.raw)])
## [1] "World total: X5.21.20"

PROCESS DATA

  • Convert to long format
  • Fix date formatting/convert to numeric date
  • Log10 transform total # cases
##------------------------------------------
## Combine death and total data frames
##------------------------------------------
Corona_Totals.raw$case_type<-"total"
Corona_Totals.US.raw$case_type<-"total"
Corona_Deaths.raw$case_type<-"death"
Corona_Deaths.US.raw$case_type<-"death"

# for some reason, Population listed in US death file but not for other data... Weird. When combining, all datasets will have this column, but US deaths is the only useful one.  
Corona_Totals.US.raw$Population<-"NA" 
Corona_Totals.raw$Population<-"NA"
Corona_Deaths.raw$Population<-"NA"

Corona_Cases.raw<-rbind(Corona_Totals.raw,Corona_Deaths.raw)
Corona_Cases.US.raw<-rbind(Corona_Totals.US.raw,Corona_Deaths.US.raw)
#TODO: custom utils- setdiff, intersect names... option to output in merging too
##------------------------------------------
# prepare raw datasets for eventual combining
##------------------------------------------
Corona_Cases.raw$City<-"NA" # US-level data has Cities
Corona_Cases.US.raw$Country_Region<-"US_state" # To differentiate from World-level stats

Corona_Cases.US.raw<-plyr::rename(Corona_Cases.US.raw,c("Province_State"="Province.State",
                                                  "Country_Region"="Country.Region",
                                                  "Long_"="Long",
                                                  "Admin2"="City"))


##------------------------------------------
## Convert to long format
##------------------------------------------
#JHU has a gross file format. It's in wide format with each column is the date in MM/DD/YY. So read this in as raw data but trasnform it to be better suited for analysis
# Furthermore, the World and US level data is formatted differently, containing different columns, etc. Recitfy this and combine the world-level stats with U.S. level stats.

Corona_Cases.long<-rbind(make_long(select(Corona_Cases.US.raw,-c(UID,iso2,iso3,code3,FIPS,Combined_Key))),
make_long(Corona_Cases.raw))


##------------------------------------------
## Fix date formatting, convert to numeric date
##------------------------------------------
Corona_Cases.long$Date<-gsub(Corona_Cases.long$Date,pattern = "^X",replacement = "0") # leading 0 read in as X
Corona_Cases.long$Date<-gsub(Corona_Cases.long$Date,pattern = "20$",replacement = "2020") # ends in .20 and not 2020
Corona_Cases.long$Date<-as.Date(Corona_Cases.long$Date,format = "%m.%d.%y")
Corona_Cases.long$Date.numeric<-as.numeric(Corona_Cases.long$Date)

kable(table(select(Corona_Cases.long,c("Country.Region","case_type"))),caption = "Number of death and total case longitudinal datapoints per geographical region")
Number of death and total case longitudinal datapoints per geographical region
death total
Afghanistan 121 121
Albania 121 121
Algeria 121 121
Andorra 121 121
Angola 121 121
Antigua and Barbuda 121 121
Argentina 121 121
Armenia 121 121
Australia 968 968
Austria 121 121
Azerbaijan 121 121
Bahamas 121 121
Bahrain 121 121
Bangladesh 121 121
Barbados 121 121
Belarus 121 121
Belgium 121 121
Belize 121 121
Benin 121 121
Bhutan 121 121
Bolivia 121 121
Bosnia and Herzegovina 121 121
Botswana 121 121
Brazil 121 121
Brunei 121 121
Bulgaria 121 121
Burkina Faso 121 121
Burma 121 121
Burundi 121 121
Cabo Verde 121 121
Cambodia 121 121
Cameroon 121 121
Canada 1694 1694
Central African Republic 121 121
Chad 121 121
Chile 121 121
China 3993 3993
Colombia 121 121
Comoros 121 121
Congo (Brazzaville) 121 121
Congo (Kinshasa) 121 121
Costa Rica 121 121
Cote d’Ivoire 121 121
Croatia 121 121
Cuba 121 121
Cyprus 121 121
Czechia 121 121
Denmark 363 363
Diamond Princess 121 121
Djibouti 121 121
Dominica 121 121
Dominican Republic 121 121
Ecuador 121 121
Egypt 121 121
El Salvador 121 121
Equatorial Guinea 121 121
Eritrea 121 121
Estonia 121 121
Eswatini 121 121
Ethiopia 121 121
Fiji 121 121
Finland 121 121
France 1331 1331
Gabon 121 121
Gambia 121 121
Georgia 121 121
Germany 121 121
Ghana 121 121
Greece 121 121
Grenada 121 121
Guatemala 121 121
Guinea 121 121
Guinea-Bissau 121 121
Guyana 121 121
Haiti 121 121
Holy See 121 121
Honduras 121 121
Hungary 121 121
Iceland 121 121
India 121 121
Indonesia 121 121
Iran 121 121
Iraq 121 121
Ireland 121 121
Israel 121 121
Italy 121 121
Jamaica 121 121
Japan 121 121
Jordan 121 121
Kazakhstan 121 121
Kenya 121 121
Korea, South 121 121
Kosovo 121 121
Kuwait 121 121
Kyrgyzstan 121 121
Laos 121 121
Latvia 121 121
Lebanon 121 121
Lesotho 121 121
Liberia 121 121
Libya 121 121
Liechtenstein 121 121
Lithuania 121 121
Luxembourg 121 121
Madagascar 121 121
Malawi 121 121
Malaysia 121 121
Maldives 121 121
Mali 121 121
Malta 121 121
Mauritania 121 121
Mauritius 121 121
Mexico 121 121
Moldova 121 121
Monaco 121 121
Mongolia 121 121
Montenegro 121 121
Morocco 121 121
Mozambique 121 121
MS Zaandam 121 121
Namibia 121 121
Nepal 121 121
Netherlands 605 605
New Zealand 121 121
Nicaragua 121 121
Niger 121 121
Nigeria 121 121
North Macedonia 121 121
Norway 121 121
Oman 121 121
Pakistan 121 121
Panama 121 121
Papua New Guinea 121 121
Paraguay 121 121
Peru 121 121
Philippines 121 121
Poland 121 121
Portugal 121 121
Qatar 121 121
Romania 121 121
Russia 121 121
Rwanda 121 121
Saint Kitts and Nevis 121 121
Saint Lucia 121 121
Saint Vincent and the Grenadines 121 121
San Marino 121 121
Sao Tome and Principe 121 121
Saudi Arabia 121 121
Senegal 121 121
Serbia 121 121
Seychelles 121 121
Sierra Leone 121 121
Singapore 121 121
Slovakia 121 121
Slovenia 121 121
Somalia 121 121
South Africa 121 121
South Sudan 121 121
Spain 121 121
Sri Lanka 121 121
Sudan 121 121
Suriname 121 121
Sweden 121 121
Switzerland 121 121
Syria 121 121
Taiwan* 121 121
Tajikistan 121 121
Tanzania 121 121
Thailand 121 121
Timor-Leste 121 121
Togo 121 121
Trinidad and Tobago 121 121
Tunisia 121 121
Turkey 121 121
Uganda 121 121
Ukraine 121 121
United Arab Emirates 121 121
United Kingdom 1331 1331
Uruguay 121 121
US 121 121
US_state 394581 394581
Uzbekistan 121 121
Venezuela 121 121
Vietnam 121 121
West Bank and Gaza 121 121
Western Sahara 121 121
Yemen 121 121
Zambia 121 121
Zimbabwe 121 121
# Decouple population and lat/long data, refactor to make it more tidy
metadata_columns<-c("Lat","Long","Population")
metadata<-unique(select(filter(Corona_Cases.long,case_type=="death"),c("Country.Region","Province.State","City",all_of(metadata_columns))))
Corona_Cases.long<-select(Corona_Cases.long,-all_of(metadata_columns))

# Some counties are not summarized on the country level. collapse all but US
Corona_Cases.long<-rbind.fill(ddply(filter(Corona_Cases.long,!Country.Region=="US_state"),c("case_type","Country.Region","Date","Date.numeric"),summarise,Total_confirmed_cases=sum(Total_confirmed_cases)),filter(Corona_Cases.long,Country.Region=="US_state"))

# Put total case and deaths side-by-side (wide)
Corona_Cases<-spread(Corona_Cases.long,key = case_type,value = Total_confirmed_cases)

#Compute moratlity rate
Corona_Cases$mortality_rate<-Corona_Cases$death/Corona_Cases$total

#TMP
Corona_Cases<-plyr::rename(Corona_Cases,c("total"="Total_confirmed_cases","death"="Total_confirmed_deaths"))

##------------------------------------------
## log10 transform total # cases
##------------------------------------------
Corona_Cases$Total_confirmed_cases.log<-log(Corona_Cases$Total_confirmed_cases,10)
Corona_Cases$Total_confirmed_deaths.log<-log(Corona_Cases$Total_confirmed_deaths,10)
##------------------------------------------
       
##------------------------------------------
## Compute # of days since 100th for US data
##------------------------------------------

# Find day that 100th case was found for Country/Province. NOTE: Non US countries may have weird provinces. For example, Frane is summairzed at the country level but also had 3 providences. I've only ensured the U.S. case100 works... so the case100_date for U.S. is summarized both for the entire country (regardless of state) and on a per-state level. 
# TODO: consider city-level summary as well. This data may be sparse

Corona_Cases<-merge(Corona_Cases,ddply(filter(Corona_Cases,Total_confirmed_cases>100),c("Country.Region"),summarise,case100_date=min(Date.numeric)))
Corona_Cases$Days_since_100<-Corona_Cases$Date.numeric-Corona_Cases$case100_date

##------------------------------------------
## Add population and lat/long data (CURRENTLY US ONLY)
##------------------------------------------

kable(filter(metadata,(is.na(Country.Region) | is.na(Population) )) %>% select(c("Country.Region","Province.State","City")) %>% unique(),caption = "Regions for which either population or Country is NA")
Regions for which either population or Country is NA
Country.Region Province.State City
# Drop missing data 
metadata<-filter(metadata,!(is.na(Country.Region) | is.na(Population) ))
# Convert remaining pop to numeric
metadata$Population<-as.numeric(metadata$Population)
## Warning: NAs introduced by coercion
# Add metadata to cases
Corona_Cases<-merge(Corona_Cases,metadata,all.x = T)

##------------------------------------------
## Compute total and death cases relative to population 
##------------------------------------------

Corona_Cases$Total_confirmed_cases.per100<-100*Corona_Cases$Total_confirmed_cases/Corona_Cases$Population
Corona_Cases$Total_confirmed_deaths.per100<-100*Corona_Cases$Total_confirmed_deaths/Corona_Cases$Population


##------------------------------------------
## Filter df for US state-wide stats
##------------------------------------------

Corona_Cases.US_state<-filter(Corona_Cases,Country.Region=="US_state" & Total_confirmed_cases>0 ) 
kable(table(select(Corona_Cases.US_state,c("Province.State"))),caption = "Number of longitudinal datapoints (total/death) per state")
Number of longitudinal datapoints (total/death) per state
Var1 Freq
Alabama 3931
Alaska 717
Arizona 1001
Arkansas 4080
California 3820
Colorado 3584
Connecticut 592
Delaware 243
Diamond Princess 66
District of Columbia 67
Florida 4226
Georgia 9366
Grand Princess 67
Guam 67
Hawaii 347
Idaho 1859
Illinois 5298
Indiana 5457
Iowa 4886
Kansas 3995
Kentucky 5883
Louisiana 3955
Maine 1002
Maryland 1553
Massachusetts 1015
Michigan 4749
Minnesota 4419
Mississippi 4900
Missouri 5336
Montana 1670
Nebraska 2974
Nevada 737
New Hampshire 697
New Jersey 1511
New Mexico 1620
New York 3801
North Carolina 5736
North Dakota 1882
Northern Mariana Islands 52
Ohio 5125
Oklahoma 3830
Oregon 1976
Pennsylvania 4065
Puerto Rico 67
Rhode Island 394
South Carolina 2869
South Dakota 2411
Tennessee 5525
Texas 11412
Utah 928
Vermont 930
Virgin Islands 67
Virginia 7202
Washington 2605
West Virginia 2586
Wisconsin 3842
Wyoming 1187
Corona_Cases.US_state<-merge(Corona_Cases.US_state,ddply(filter(Corona_Cases.US_state,Total_confirmed_cases>100),c("Province.State"),summarise,case100_date_state=min(Date.numeric)))
Corona_Cases.US_state$Days_since_100_state<-Corona_Cases.US_state$Date.numeric-Corona_Cases.US_state$case100_date_state

ANALYSIS

Q1: What is the trend in cases, mortality across geopgraphical regions?

Plot # of cases vs time
* For each geographical set:
* comparative longitudinal case trend (absolute & log scale)
* comparative longitudinal mortality trend
* death vs total correlation

question dataset x y color facet pch dimentions
comparative_longitudinal_case_trend long time log_cases geography none (case type?) case_type [15, 50, 4] geography x (2 scale?) case type
comparative longitudinal case trend long time cases geography case_type ? [15, 50, 4] geography x (2+ scale) case type
comparative longitudinal mortality trend wide time mortality rate geography none none [15, 50, 4] geography
death vs total correlation wide cases deaths geography none none [15, 50, 4] geography
# total cases vs time
# death cases vs time
# mortality rate vs time
# death vs mortality


  # death vs mortality
  # total & death case vs time (same plot)

#<question> <x> <y> <colored> <facet> <dataset>
## trend in case/deaths over time, comapred across regions <time> <log cases> <geography*> <none> <.wide>
## trend in case/deaths over time, comapred across regions <time> <cases> <geography*> <case_type> <.long>
## trend in mortality rate over time, comapred across regions <time> <mortality rate> <geography*> <none>
## how are death/mortality related/correlated? <time> <log cases> <geography*> <none>
## how are death and case load correlated? <cases> <deaths>

# lm for each?? - > apply lm from each region starting from 100th case. m, b associated with each.
    # input: geographical regsion, logcase vs day (100th case)
    # output: m, b for each geographical region ID



#total/death on same plot-  diffeer by 2 logs, so when plotting log, use pch. when plotting absolute, need to use free scales
#when plotting death and case on same, melt. 

#CoronaCases - > filter sets (3)
  #world - choose countries with sufficent data

N<-ddply(filter(Corona_Cases,Total_confirmed_cases>100),c("Country.Region"),summarise,n=length(Country.Region))
ggplot(filter(N,n<100),aes(x=n))+
  geom_histogram()+
  default_theme+
  ggtitle("Distribution of number of days with at least 100 confirmed cases for each region")
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

kable(arrange(N,-n),caption="Sorted number of days with at least 100 confirmed cases")
Sorted number of days with at least 100 confirmed cases
Country.Region n
US_state 33597
China 121
Diamond Princess 102
Korea, South 92
Japan 91
Italy 89
Iran 86
Singapore 83
France 82
Germany 82
Spain 81
US 80
Switzerland 78
United Kingdom 78
Belgium 77
Netherlands 77
Norway 77
Sweden 77
Austria 75
Malaysia 74
Australia 73
Bahrain 73
Denmark 73
Canada 72
Qatar 72
Iceland 71
Brazil 70
Czechia 70
Finland 70
Greece 70
Iraq 70
Israel 70
Portugal 70
Slovenia 70
Egypt 69
Estonia 69
India 69
Ireland 69
Kuwait 69
Philippines 69
Poland 69
Romania 69
Saudi Arabia 69
Indonesia 68
Lebanon 68
San Marino 68
Thailand 68
Chile 67
Pakistan 67
Luxembourg 66
Peru 66
Russia 66
Ecuador 65
Mexico 65
Slovakia 65
South Africa 65
United Arab Emirates 65
Armenia 64
Colombia 64
Croatia 64
Panama 64
Serbia 64
Taiwan* 64
Turkey 64
Argentina 63
Bulgaria 63
Latvia 63
Uruguay 63
Algeria 62
Costa Rica 62
Dominican Republic 62
Hungary 62
Andorra 61
Bosnia and Herzegovina 61
Jordan 61
Lithuania 61
Morocco 61
New Zealand 61
North Macedonia 61
Vietnam 61
Albania 60
Cyprus 60
Malta 60
Moldova 60
Brunei 59
Burkina Faso 59
Sri Lanka 59
Tunisia 59
Ukraine 58
Azerbaijan 57
Ghana 57
Kazakhstan 57
Oman 57
Senegal 57
Venezuela 57
Afghanistan 56
Cote d’Ivoire 56
Cuba 55
Mauritius 55
Uzbekistan 55
Cambodia 54
Cameroon 54
Honduras 54
Nigeria 54
West Bank and Gaza 54
Belarus 53
Georgia 53
Bolivia 52
Kosovo 52
Kyrgyzstan 52
Montenegro 52
Congo (Kinshasa) 51
Kenya 50
Niger 49
Guinea 48
Rwanda 48
Trinidad and Tobago 48
Paraguay 47
Bangladesh 46
Djibouti 44
El Salvador 43
Guatemala 42
Madagascar 41
Mali 40
Congo (Brazzaville) 37
Jamaica 37
Gabon 35
Somalia 35
Tanzania 35
Ethiopia 34
Burma 33
Sudan 32
Liberia 31
Maldives 29
Equatorial Guinea 28
Cabo Verde 26
Sierra Leone 24
Guinea-Bissau 23
Togo 23
Zambia 22
Eswatini 21
Chad 20
Tajikistan 19
Haiti 17
Sao Tome and Principe 17
Benin 15
Nepal 15
Uganda 15
Central African Republic 14
South Sudan 14
Guyana 12
Mozambique 11
Yemen 7
Mongolia 6
Mauritania 3
Nicaragua 3
# Pick top 15 countries with data
max_colors<-12
# find way to fix this- China has diff provences. Plot doesnt look right...
sufficient_data<-arrange(filter(N,!Country.Region %in% c("US_state", "Diamond Princess")),-n)[1:max_colors,]
kable(sufficient_data,caption = paste0("Top ",max_colors," countries with sufficient data"))
Top 12 countries with sufficient data
Country.Region n
China 121
Korea, South 92
Japan 91
Italy 89
Iran 86
Singapore 83
France 82
Germany 82
Spain 81
US 80
Switzerland 78
United Kingdom 78
Corona_Cases.world<-filter(Corona_Cases,Country.Region %in% c(sufficient_data$Country.Region))


  #us 
  #    - by state
Corona_Cases.US<-filter(Corona_Cases,Country.Region=="US" & Total_confirmed_cases>0)
# summarize 
#!City %in% c("Unassigned") 
  #    - specific cities
#mortality_rate!=Inf & mortality_rate<=1
Corona_Cases.UScity<-filter(Corona_Cases,Province.State %in% c("Pennsylvania","Maryland","New York","New Jersey") & City %in% c("Bucks","Baltimore City", "New York","Burlington","Cape May"))

measure_vars_long<-c("Total_confirmed_cases.log","Total_confirmed_cases","Total_confirmed_deaths","Total_confirmed_deaths.log")
melt_arg_list<-list(variable.name = "case_type",value.name = "cases",measure.vars = c("Total_confirmed_cases","Total_confirmed_deaths"))
melt_arg_list$data=NULL


melt_arg_list$data=select(Corona_Cases.world,-ends_with(match = "log"))
Corona_Cases.world.long<-do.call(melt,melt_arg_list)
melt_arg_list$data=select(Corona_Cases.UScity,-ends_with(match = "log"))
Corona_Cases.UScity.long<-do.call(melt,melt_arg_list)
melt_arg_list$data=select(Corona_Cases.US_state,-ends_with(match = "log"))
Corona_Cases.US_state.long<-do.call(melt,melt_arg_list)

Corona_Cases.world.long$cases.log<-log(Corona_Cases.world.long$cases,10)
Corona_Cases.US_state.long$cases.log<-log(Corona_Cases.US_state.long$cases,10)
Corona_Cases.UScity.long$cases.log<-log(Corona_Cases.UScity.long$cases,10)


# what is the current death and total case load for US? For world? For states?
#-absolute
#-log

# what is mortality rate (US, world)
#-absolute

#how is death and case correlated? (US, world)
#-absolute
#Corona_Cases.US<-filter(Corona_Cases,Country.Region=="US" & Total_confirmed_cases>0)
#Corona_Cases.US.case100<-filter(Corona_Cases.US, Days_since_100>=0)
# linear model parameters
#(model_fit<-lm(formula = Total_confirmed_cases.log~Days_since_100,data= Corona_Cases.US.case100 ))

#(slope<-model_fit$coefficients[2])
#(intercept<-model_fit$coefficients[1])

# Correlation coefficient
#cor(x = Corona_Cases.US.case100$Days_since_100,y = Corona_Cases.US.case100$Total_confirmed_cases.log)

##------------------------------------------
## Plot World Data
##------------------------------------------
# Timestamp for world
timestamp_plot.world<-paste("Most recent date for which data available:",max(Corona_Cases.world$Date))#timestamp(quiet = T,prefix = "Updated ",suffix = " (EST)")


# Base template for plots
baseplot.world<-ggplot(data=NULL,aes(x=Days_since_100,col=Country.Region))+
  default_theme+
  scale_color_brewer(type = "qualitative",palette = "Paired")+
  ggtitle(paste("Log10 cases over time,",timestamp_plot.world))+
  theme(legend.position = "bottom",plot.title = element_text(size=12))


##/////////////////////////
### Plot Longitudinal cases

(Corona_Cases.world.long.plot<-baseplot.world+
    geom_point(data=Corona_Cases.world.long,aes(y=cases))+
    geom_line(data=Corona_Cases.world.long,aes(y=cases))+
    facet_wrap(~case_type,scales = "free_y",ncol=1)+
    ggtitle(timestamp_plot.world)
    )

(Corona_Cases.world.loglong.plot<-baseplot.world+
    geom_point(data=Corona_Cases.world.long,aes(y=cases.log))+
    geom_line(data=Corona_Cases.world.long,aes(y=cases.log))+
    facet_wrap(~case_type,scales = "free_y",ncol=1)+
    ggtitle(timestamp_plot.world))

##/////////////////////////
### Plot Longitudinal mortality rate

(Corona_Cases.world.mortality.plot<-baseplot.world+
    geom_point(data=Corona_Cases.world,aes(y=mortality_rate))+
    geom_line(data=Corona_Cases.world,aes(y=mortality_rate))+
    ylim(c(0,0.3))+
    ggtitle(timestamp_plot.world))
## Warning: Removed 100 rows containing missing values (geom_point).
## Warning: Removed 100 row(s) containing missing values (geom_path).

##/////////////////////////
### Plot death vs total case correlation

(Corona_Cases.world.casecor.plot<-ggplot(Corona_Cases.world,aes(x=Total_confirmed_cases,y=Total_confirmed_deaths,col=Country.Region))+
  geom_point()+
  geom_line()+
  default_theme+
  scale_color_brewer(type = "qualitative",palette = "Paired")+
  ggtitle(paste("Log10 cases over time,",timestamp_plot.world))+
  theme(legend.position = "bottom",plot.title = element_text(size=12))+
    ggtitle(timestamp_plot.world))

### Write polots

write_plot(Corona_Cases.world.long.plot,wd = results_dir)
## [1] "/Users/stevensmith/Projects/coronavirus/results/Corona_Cases.world.long.plot.png"
write_plot(Corona_Cases.world.loglong.plot,wd = results_dir)
## [1] "/Users/stevensmith/Projects/coronavirus/results/Corona_Cases.world.loglong.plot.png"
write_plot(Corona_Cases.world.mortality.plot,wd = results_dir)
## Warning: Removed 100 rows containing missing values (geom_point).

## Warning: Removed 100 row(s) containing missing values (geom_path).
## [1] "/Users/stevensmith/Projects/coronavirus/results/Corona_Cases.world.mortality.plot.png"
write_plot(Corona_Cases.world.casecor.plot,wd = results_dir)
## [1] "/Users/stevensmith/Projects/coronavirus/results/Corona_Cases.world.casecor.plot.png"
##------------------------------------------
## Plot US State Data
##-----------------------------------------

baseplot.US<-ggplot(data=NULL,aes(x=Days_since_100_state,col=case_type))+
  default_theme+
  facet_wrap(~Province.State)+
  ggtitle(paste("Log10 cases over time,",timestamp_plot.world))

Corona_Cases.US_state.long.plot<-baseplot.US+geom_point(data=Corona_Cases.US_state.long,aes(y=cases.log))
##------------------------------------------
## Plot US City Data
##-----------------------------------------

Corona_Cases.US.plotdata<-filter(Corona_Cases.US_state,Province.State %in% c("Pennsylvania","Maryland","New York","New Jersey") &
                                   City %in% c("Bucks","Baltimore City", "New York","Burlington","Cape May") &
                                   Total_confirmed_cases>0) 
timestamp_plot<-paste("Most recent date for which data available:",max(Corona_Cases.US.plotdata$Date))#timestamp(quiet = T,prefix = "Updated ",suffix = " (EST)")

city_colors<-c("Bucks"='#beaed4',"Baltimore City"='#386cb0', "New York"='#7fc97f',"Burlington"='#fdc086',"Cape May"="#e78ac3")

##/////////////////////////
### Plot death vs total case correlation

(Corona_Cases.city.loglong.plot<-ggplot(melt(Corona_Cases.US.plotdata,measure.vars = c("Total_confirmed_cases.log","Total_confirmed_deaths.log"),variable.name = "case_type",value.name = "cases"),aes(x=Date,y=cases,col=City,pch=case_type))+
  geom_point(size=4)+
    geom_line()+
  default_theme+
  #facet_wrap(~case_type)+
    ggtitle(paste("Log10 total and death cases over time,",timestamp_plot))+
theme(legend.position = "bottom",plot.title = element_text(size=12),axis.text.x = element_text(angle=45,hjust=1))+
    scale_color_manual(values = city_colors)+
  scale_x_date(date_breaks="1 week",date_minor_breaks="1 day"))

(Corona_Cases.city.long.plot<-ggplot(filter(Corona_Cases.US.plotdata,Province.State !="New York"),aes(x=Date,y=Total_confirmed_cases,col=City))+
  geom_point(size=4)+
  geom_line()+
  default_theme+
  facet_grid(~Province.State,scales = "free_y")+
  ggtitle(paste("MD, PA, NJ total cases over time,",timestamp_plot))+
  theme(legend.position = "bottom",plot.title = element_text(size=12),axis.text.x = element_text(angle=45,hjust=1))
+
  scale_color_manual(values = city_colors)+
  scale_x_date(date_breaks="1 week",date_minor_breaks="1 day"))

(Corona_Cases.city.mortality.plot<-ggplot(Corona_Cases.US.plotdata,aes(x=Date,y=mortality_rate,col=City))+
  geom_point(size=3)+
  geom_line(size=2)+
  default_theme+
  ggtitle(paste("Mortality rate (deaths/total) over time,",timestamp_plot))+
  theme(legend.position = "bottom",plot.title = element_text(size=12),axis.text.x = element_text(angle=45,hjust=1))+
  scale_color_manual(values = city_colors)+
  scale_x_date(date_breaks="1 week",date_minor_breaks="1 day"))

(Corona_Cases.city.casecor.plot<-ggplot(filter(Corona_Cases.US.plotdata,Province.State !="New York"),aes(y=Total_confirmed_deaths,x=Total_confirmed_cases,col=City))+
  geom_point(size=3)+
  geom_line(size=2)+
  default_theme+
  ggtitle(paste("Correlation of death vs total cases,",timestamp_plot))+
  theme(legend.position = "bottom",plot.title = element_text(size=12))+
  scale_color_manual(values = city_colors))

(Corona_Cases.city.long.normalized.plot<-ggplot(filter(Corona_Cases.US.plotdata,Province.State !="New York"),aes(x=Date,y=Total_confirmed_cases.per100,col=City))+
  geom_point(size=4)+
  geom_line()+
  default_theme+
  facet_grid(~Province.State)+
  ggtitle(paste("MD, PA, NJ total cases over time per 100 people,",timestamp_plot))+
  theme(legend.position = "bottom",plot.title = element_text(size=12),axis.text.x = element_text(angle=45,hjust=1))+
  scale_color_manual(values = city_colors)  +
  scale_x_date(date_breaks="1 week",date_minor_breaks="1 day"))

write_plot(Corona_Cases.city.long.plot,wd = results_dir_custom)
## [1] "/Users/stevensmith/Projects/coronavirus/results/custom/Corona_Cases.city.long.plot.png"
write_plot(Corona_Cases.city.loglong.plot,wd = results_dir_custom)
## [1] "/Users/stevensmith/Projects/coronavirus/results/custom/Corona_Cases.city.loglong.plot.png"
write_plot(Corona_Cases.city.mortality.plot,wd = results_dir_custom)
## [1] "/Users/stevensmith/Projects/coronavirus/results/custom/Corona_Cases.city.mortality.plot.png"
write_plot(Corona_Cases.city.casecor.plot,wd = results_dir_custom)
## [1] "/Users/stevensmith/Projects/coronavirus/results/custom/Corona_Cases.city.casecor.plot.png"
write_plot(Corona_Cases.city.long.normalized.plot,wd = results_dir_custom)
## [1] "/Users/stevensmith/Projects/coronavirus/results/custom/Corona_Cases.city.long.normalized.plot.png"

Q1b what is the model

Fit the cases to a linear model 1. Find time at which the case vs date becomes linear in each plot
2. Fit linear model for each city

# What is the predict # of cases for the next few days?
# How is the model performing historically?

Corona_Cases.US_state.summary<-ddply(Corona_Cases.US_state,
                                     c("Province.State","Date"),
                                     summarise,
                                     Total_confirmed_cases_perstate=sum(Total_confirmed_cases)) %>% 
    filter(Total_confirmed_cases_perstate>100)

# Compute the states with the most cases (for coloring and for linear model)
top_states_totals<-head(ddply(Corona_Cases.US_state.summary,c("Province.State"),summarise, Total_confirmed_cases_perstate.max=max(Total_confirmed_cases_perstate)) %>% arrange(-Total_confirmed_cases_perstate.max),n=max_colors)

kable(top_states_totals,caption = "Top 12 States, total count ")
top_states<-top_states_totals$Province.State

# Manually fix states so that Maryland is switched out for New York
top_states_modified<-c(top_states[top_states !="New York"],"Maryland")

# Plot with all states:
(Corona_Cases.US_state.summary.plot<-ggplot(Corona_Cases.US_state.summary,aes(x=Date,y=Total_confirmed_cases_perstate))+
  geom_point()+
  geom_point(data=filter(Corona_Cases.US_state.summary,Province.State %in% top_states),aes(col=Province.State))+
  scale_color_brewer(type = "qualitative",palette = "Paired")+
  default_theme+
  theme(axis.text.x = element_text(angle=45,hjust=1),legend.position = "bottom")+
  ggtitle("Total confirmed cases per state, top 12 colored")+
  scale_x_date(date_breaks="1 week",date_minor_breaks="1 day"))

##------------------------------------------
## Fit linear model to time vs total cases
##-----------------------------------------

# First, find the date at which each state's cases vs time becomes lienar (2nd derivative is about 0)
li<-ddply(Corona_Cases.US_state.summary,c("Province.State"),find_linear_index)

# Compute linear model for each state starting at the point at which data becomes linear
for(i in 1:nrow(li)){
  Province.State.i<-li[i,"Province.State"]
  date.i<-li[i,"V1"]
  data.i<-filter(Corona_Cases.US_state.summary,Province.State==Province.State.i & as.numeric(Date) >= date.i)
  model_results<-lm(data.i,formula = Total_confirmed_cases_perstate~Date)
  slope<-model_results$coefficients[2]
  intercept<-model_results$coefficients[1]
  li[li$Province.State==Province.State.i,"m"]<-slope
  li[li$Province.State==Province.State.i,"b"]<-intercept
  }

# Compute top state case load with fitted model

(Corona_Cases.US_state.lm.plot<-ggplot(filter(Corona_Cases.US_state.summary,Province.State %in% top_states_modified ))+
    geom_abline(data=filter(li,Province.State %in% top_states_modified),
                aes(slope = m,intercept = b,col=Province.State),lty=2)+
    geom_point(aes(x=Date,y=Total_confirmed_cases_perstate,col=Province.State))+
    scale_color_brewer(type = "qualitative",palette = "Paired")+
    default_theme+
    theme(axis.text.x = element_text(angle=45,hjust=1),legend.position = "bottom")+
    ggtitle("Total confirmed cases per state, top 12 colored")+
    scale_x_date(date_breaks="1 week",date_minor_breaks="1 day"))

##------------------------------------------
## Predict the number of total cases over the next week
##-----------------------------------------

predicted_days<-c(0,1,2,3,7)+as.numeric(as.Date("2020-04-20"))

predicted_days_df<-data.frame(matrix(ncol=3))
names(predicted_days_df)<-c("Province.State","days","Total_confirmed_cases_perstate")

# USe model parameters to estiamte case loads
for(state.i in top_states_modified){
  predicted_days_df<-rbind(predicted_days_df,
                           data.frame(Province.State=state.i,
                                      prediction_model(m = li[li$Province.State==state.i,"m"],
                                                       b =li[li$Province.State==state.i,"b"] ,
                                                       days =predicted_days )))
  }

predicted_days_df$Date<-as.Date(predicted_days_df$days,origin="1970-01-01")

kable(predicted_days_df,caption = "Predicted total cases over the next week for selected states")

##------------------------------------------
## Write plots
##-----------------------------------------

write_plot(Corona_Cases.US_state.summary.plot,wd = results_dir)
write_plot(Corona_Cases.US_state.lm.plot,wd = results_dir)

##------------------------------------------
## Write tables
##-----------------------------------------

write.csv(predicted_days_df,file = paste0(results_dir,"predicted_total_cases_days.csv"),quote = F,row.names = F)

Q2: What is the predicted number of cases?

What is the prediction of COVID-19 based on model thus far? Additional questions:

WHy did it take to day 40 to start a log linear trend? How long will it be till x number of cases? When will the plateu happen? Are any effects noticed with social distancing? Delays

##------------------------------------------
## Prediction and Prediction Accuracy
##------------------------------------------


today_num<-max(Corona_Cases.US$Days_since_100)
predicted_days<-today_num+c(1,2,3,7)

#mods = dlply(mydf, .(x3), lm, formula = y ~ x1 + x2)
#today:
Corona_Cases.US[Corona_Cases.US$Days_since_100==(today_num-1),]
Corona_Cases.US[Corona_Cases.US$Days_since_100==today_num,]
Corona_Cases.US$type<-"Historical"


#prediction_values<-prediction_model(m=slope,b=intercept,days = predicted_days)$Total_confirmed_cases

histoical_model<-data.frame(date=today_num,m=slope,b=intercept)
tmp<-data.frame(state=rep(c("A","B"),each=3),x=c(1,2,3,4,5,6))
tmp$y<-c(tmp[1:3,"x"]+5,tmp[4:6,"x"]*5+1)
ddply(tmp,c("state"))
lm(data =tmp,formula = y~x )

train_lm<-function(input_data,subset_coulmn,formula_input){
case_models <- dlply(input_data, subset_coulmn, lm, formula = formula_input)
case_models.parameters <- ldply(case_models, coef)
case_models.parameters<-rename(case_models.parameters,c("b"="(Intercept)","m"=subset_coulmn))
return(case_models.parameters)
}

train_lm(tmp,"state")

 dlply(input_data, subset_coulmn, lm,m=)
 
# model for previous y days
#historical_model_predictions<-data.frame(day_x=NULL,Days_since_100=NULL,Total_confirmed_cases=NULL,Total_confirmed_cases.log=NULL)
# for(i in c(1,2,3,4,5,6,7,8,9,10)){
#   #i<-1
# day_x<-today_num-i # 1, 2, 3, 4
# day_x_nextweek<-day_x+c(1,2,3)
# model_fit_x<-lm(data = filter(Corona_Cases.US.case100,Days_since_100 < day_x),formula = Total_confirmed_cases.log~Days_since_100)
# prediction_day_x_nextweek<-prediction_model(m = model_fit_x$coefficients[2],b = model_fit_x$coefficients[1],days = day_x_nextweek)
# prediction_day_x_nextweek$type<-"Predicted"
# acutal_day_x_nextweek<-filter(Corona_Cases.US,Days_since_100 %in% day_x_nextweek) %>% select(c(Days_since_100,Total_confirmed_cases,Total_confirmed_cases.log))
# acutal_day_x_nextweek$type<-"Historical"
# historical_model_predictions.i<-data.frame(day_x=day_x,rbind(acutal_day_x_nextweek,prediction_day_x_nextweek))
# historical_model_predictions<-rbind(historical_model_predictions.i,historical_model_predictions)
# }

#historical_model_predictions.withHx<-rbind.fill(historical_model_predictions,data.frame(Corona_Cases.US,type="Historical"))
#historical_model_predictions.withHx$Total_confirmed_cases.log2<-log(historical_model_predictions.withHx$Total_confirmed_cases,2)

(historical_model_predictions.plot<-ggplot(historical_model_predictions.withHx,aes(x=Days_since_100,y=Total_confirmed_cases.log,col=type))+
    geom_point(size=3)+
    default_theme+
    theme(legend.position = "bottom")+ 
      #geom_abline(slope = slope,intercept =intercept,lty=2)+
    #facet_wrap(~case_type,ncol=1)+
    scale_color_manual(values = c("Historical"="#377eb8","Predicted"="#e41a1c")))
write_plot(historical_model_predictions.plot,wd=results_dir)

Q3: What is the effect on social distancing, descreased mobility on case load?

Load data from Google which compoutes % change in user mobility relative to baseline for * Recreation
* Workplace
* Residence
* Park
* Grocery

Data from https://www.google.com/covid19/mobility/

# See pre-processing section for script on gathering mobility data

# UNDER DEVELOPMENT

mobility<-read.csv("/Users/stevensmith/Projects/MIT_COVID19/mobility.csv",header = T,stringsAsFactors = F)
#mobility$Retail_Recreation<-as.numeric(sub(mobility$Retail_Recreation,pattern = "%",replacement = ""))
#mobility$Workplace<-as.numeric(sub(mobility$Workplace,pattern = "%",replacement = ""))
#mobility$Residential<-as.numeric(sub(mobility$Residential,pattern = "%",replacement = ""))

##------------------------------------------
## Show relationship between mobility and caseload
##------------------------------------------
mobility$County<-gsub(mobility$County,pattern = " County",replacement = "")
Corona_Cases.US_state.mobility<-merge(Corona_Cases.US_state,plyr::rename(mobility,c("State"="Province.State","County"="City")))

#Corona_Cases.US_state.tmp<-merge(metadata,Corona_Cases.US_state.tmp)
# Needs to happen upsteam, see todos
#Corona_Cases.US_state.tmp$Total_confirmed_cases.perperson<-Corona_Cases.US_state.tmp$Total_confirmed_cases/as.numeric(Corona_Cases.US_state.tmp$Population)
mobility_measures<-c("Retail_Recreation","Grocery_Pharmacy","Parks","Transit","Workplace","Residential")

plot_data<-filter(Corona_Cases.US_state.mobility, Date.numeric==max(Corona_Cases.US_state$Date.numeric) ) %>% melt(measure.vars=mobility_measures) 
plot_data$value<-as.numeric(gsub(plot_data$value,pattern = "%",replacement = ""))
plot_data<-filter(plot_data,!is.na(value))

(mobility.plot<-ggplot(filter(plot_data,Province.State %in% c("Pennsylvania","Maryland","New Jersey","California","Delaware","Connecticut")),aes(y=Total_confirmed_cases.per100,x=value))+geom_point()+
  facet_grid(Province.State~variable,scales = "free")+
  xlab("Mobility change from baseline (%)")+
  ylab(paste0("Confirmed cases per 100 people(Today)"))+
  default_theme+
  ggtitle("Mobility change vs cases"))

(mobility.global.plot<-ggplot(plot_data,aes(y=Total_confirmed_cases.per100,x=value))+geom_point()+
  facet_wrap(~variable,scales = "free")+
  xlab("Mobility change from baseline (%)")+
  ylab(paste0("Confirmed cases (Today) per 100 people"))+
  default_theme+
  ggtitle("Mobility change vs cases"))

plot_data.permobility_summary<-ddply(plot_data,c("Province.State","variable"),summarise,cor=cor(y =Total_confirmed_cases.per100,x=value),median_change=median(x=value)) %>% arrange(-abs(cor))

kable(plot_data.permobility_summary,caption = "Ranked per-state mobility correlation with total confirmed cases")
Ranked per-state mobility correlation with total confirmed cases
Province.State variable cor median_change
Alaska Transit -1.0000000 -63.0
Delaware Retail_Recreation 1.0000000 -39.5
Delaware Grocery_Pharmacy 1.0000000 -17.5
Delaware Transit 1.0000000 -37.0
Delaware Parks -1.0000000 20.5
Delaware Workplace 1.0000000 -37.0
Delaware Residential -1.0000000 14.0
Hawaii Retail_Recreation 0.9922387 -56.0
Hawaii Grocery_Pharmacy 0.9675632 -34.0
New Hampshire Parks 0.9582767 -20.0
Connecticut Grocery_Pharmacy -0.9029214 -6.0
Maine Transit -0.8993371 -50.0
Alaska Residential 0.8865220 13.0
South Dakota Parks 0.8830750 -26.0
Vermont Parks 0.8519884 -35.5
Utah Residential -0.8380201 12.0
Alaska Grocery_Pharmacy -0.8018082 -7.0
Hawaii Residential -0.7903929 19.0
Massachusetts Workplace -0.7751019 -39.0
Utah Transit -0.7739098 -18.0
Rhode Island Workplace -0.7446594 -39.5
Connecticut Transit -0.7439114 -50.0
Alaska Workplace -0.7307515 -34.0
Wyoming Transit -0.7299510 -17.0
Utah Parks -0.7109591 17.0
Wyoming Parks -0.6822642 -4.0
Hawaii Parks 0.6754971 -72.0
Utah Workplace -0.6701793 -37.0
Vermont Grocery_Pharmacy -0.6611492 -25.0
Maine Workplace -0.6507652 -30.0
New York Workplace -0.6436606 -34.5
Rhode Island Retail_Recreation -0.6275327 -45.0
Montana Workplace -0.6239388 -40.5
New Jersey Workplace -0.6235008 -44.0
Rhode Island Residential -0.6215121 18.5
Arizona Grocery_Pharmacy -0.6178463 -15.0
Hawaii Transit 0.6126003 -89.0
Nebraska Workplace 0.6040628 -32.5
New Jersey Parks -0.5899079 -6.0
New York Retail_Recreation -0.5861270 -46.0
North Dakota Retail_Recreation -0.5760095 -42.0
Hawaii Workplace 0.5463305 -46.0
Massachusetts Retail_Recreation -0.5359455 -44.0
Connecticut Residential 0.5309730 14.0
New York Parks 0.5255410 20.0
New Jersey Retail_Recreation -0.5220336 -62.5
Connecticut Retail_Recreation -0.5044319 -45.0
West Virginia Parks 0.5008009 -33.0
Arizona Retail_Recreation -0.5005883 -42.5
Maine Parks 0.4917020 -31.0
Montana Parks -0.4913929 -58.0
Connecticut Workplace -0.4911561 -39.0
Nebraska Residential -0.4878737 14.0
Wyoming Workplace -0.4844452 -31.0
Iowa Parks -0.4843777 28.5
Arkansas Parks -0.4840191 -12.0
New Jersey Grocery_Pharmacy -0.4821466 2.5
New Mexico Grocery_Pharmacy -0.4806054 -11.0
Rhode Island Parks 0.4805924 52.0
Montana Residential 0.4701424 14.0
Idaho Workplace -0.4604719 -29.0
New Mexico Parks 0.4597290 -31.5
Illinois Transit -0.4544202 -31.0
North Dakota Parks 0.4517393 -34.0
Wisconsin Transit -0.4511274 -23.5
New Mexico Residential 0.4480622 13.5
Massachusetts Grocery_Pharmacy -0.4360026 -7.0
Idaho Transit -0.4346598 -30.0
Kentucky Parks -0.4328704 28.5
Pennsylvania Workplace -0.4316174 -36.0
Vermont Residential 0.4312246 11.5
California Transit -0.4311311 -42.0
California Residential 0.4273991 14.0
New Jersey Transit -0.4225150 -50.5
Kansas Parks 0.4185694 72.0
Montana Retail_Recreation -0.4145442 -51.0
South Carolina Workplace 0.4140267 -30.0
New Hampshire Residential -0.4098081 14.0
Idaho Grocery_Pharmacy -0.3983169 -4.5
Alabama Workplace -0.3972839 -29.0
Maryland Workplace -0.3911901 -35.0
Montana Transit -0.3892824 -41.0
Maryland Grocery_Pharmacy -0.3836165 -10.0
Arizona Residential 0.3832570 13.0
New York Transit -0.3746120 -48.0
Florida Residential 0.3691193 14.0
Alabama Transit -0.3624425 -36.5
Alabama Grocery_Pharmacy -0.3602373 -2.0
New Mexico Retail_Recreation -0.3584947 -42.5
Wyoming Grocery_Pharmacy -0.3567237 -10.0
Pennsylvania Retail_Recreation -0.3552358 -45.0
California Parks -0.3530093 -38.5
Arizona Transit 0.3496881 -38.0
Nevada Transit -0.3459208 -20.0
Pennsylvania Parks 0.3289523 13.0
Michigan Parks 0.3284272 30.0
Montana Grocery_Pharmacy -0.3279939 -16.0
Idaho Retail_Recreation -0.3267680 -40.5
Nebraska Grocery_Pharmacy 0.3218675 -0.5
Minnesota Transit -0.3159510 -28.5
Alaska Retail_Recreation 0.3115519 -39.0
North Carolina Grocery_Pharmacy 0.3030644 0.0
West Virginia Grocery_Pharmacy -0.3020151 -6.0
North Dakota Workplace 0.2994541 -40.0
Utah Retail_Recreation -0.2966547 -40.0
Colorado Residential 0.2956514 14.0
Kansas Workplace 0.2941208 -33.0
Florida Parks -0.2932791 -43.0
Vermont Retail_Recreation 0.2924439 -57.0
Maine Retail_Recreation -0.2889205 -42.0
California Retail_Recreation -0.2837217 -44.0
Rhode Island Transit -0.2833133 -56.0
Texas Workplace 0.2803509 -32.0
Texas Residential -0.2792701 15.0
Arkansas Retail_Recreation -0.2767656 -30.0
California Grocery_Pharmacy -0.2724862 -12.0
Maryland Retail_Recreation -0.2688604 -39.0
Oregon Grocery_Pharmacy 0.2687108 -7.0
Nevada Residential 0.2667587 17.0
Mississippi Residential 0.2657378 13.0
Maryland Residential 0.2630173 15.0
Nevada Retail_Recreation -0.2629780 -43.0
Virginia Transit -0.2579386 -33.0
Rhode Island Grocery_Pharmacy 0.2575047 -7.5
Illinois Workplace -0.2548683 -31.0
Texas Parks 0.2530252 -42.0
California Workplace -0.2527114 -36.0
Tennessee Workplace -0.2517073 -31.0
North Carolina Workplace 0.2508282 -31.0
Wisconsin Parks 0.2491366 51.5
Georgia Grocery_Pharmacy -0.2483913 -10.0
Tennessee Residential 0.2429816 11.5
Pennsylvania Grocery_Pharmacy -0.2361065 -6.0
New York Grocery_Pharmacy -0.2324447 8.0
North Carolina Transit 0.2311491 -32.0
Arkansas Residential 0.2290277 12.0
Illinois Parks 0.2281766 26.5
Vermont Workplace -0.2225964 -43.0
Michigan Workplace -0.2225856 -40.0
Washington Workplace -0.2204407 -38.0
South Carolina Parks -0.2189900 -23.0
North Carolina Residential 0.2155947 13.0
New Jersey Residential 0.2116381 18.0
Oregon Residential 0.2078940 10.5
Iowa Transit 0.2045321 -24.0
Missouri Residential -0.2001303 13.0
Kansas Grocery_Pharmacy -0.1988602 -14.0
Alabama Parks 0.1950833 -1.0
South Dakota Transit -0.1945599 -40.0
Mississippi Grocery_Pharmacy -0.1915860 -8.0
Missouri Workplace 0.1913223 -28.5
Georgia Workplace -0.1900152 -33.5
Wyoming Retail_Recreation -0.1886983 -39.0
North Dakota Grocery_Pharmacy -0.1876618 -8.0
Illinois Residential 0.1841242 14.0
Colorado Parks -0.1791193 2.0
Texas Transit 0.1715703 -41.5
Georgia Residential -0.1711913 13.0
Virginia Grocery_Pharmacy -0.1707897 -8.0
Oklahoma Residential 0.1696894 15.0
Oklahoma Parks -0.1690216 -18.5
Virginia Parks 0.1682274 6.0
New Mexico Transit 0.1682172 -38.5
Georgia Retail_Recreation -0.1665717 -41.0
Ohio Transit 0.1657983 -28.0
Virginia Residential 0.1620822 14.0
Wisconsin Residential -0.1600862 14.0
Massachusetts Transit -0.1550985 -45.0
Washington Transit -0.1528533 -33.5
Maine Residential -0.1523133 11.0
Nebraska Parks 0.1510679 55.5
Michigan Retail_Recreation -0.1465955 -53.0
South Carolina Residential -0.1464623 12.0
Washington Residential 0.1461195 13.0
South Dakota Retail_Recreation -0.1455905 -38.5
North Dakota Transit 0.1451393 -48.0
New Hampshire Retail_Recreation -0.1440884 -41.0
Florida Workplace -0.1419087 -33.0
Indiana Residential 0.1402456 12.0
Idaho Residential -0.1389515 11.0
Indiana Retail_Recreation 0.1365904 -38.0
Florida Retail_Recreation 0.1347963 -43.0
Oregon Retail_Recreation 0.1330007 -41.0
Pennsylvania Transit -0.1318009 -41.5
Massachusetts Residential 0.1310585 15.0
Massachusetts Parks 0.1302336 39.0
Maine Grocery_Pharmacy -0.1302041 -13.0
Connecticut Parks 0.1297954 43.0
Mississippi Transit -0.1297069 -38.5
Ohio Parks -0.1292685 67.5
West Virginia Workplace 0.1249417 -33.0
Idaho Parks 0.1233609 -22.0
Kansas Transit -0.1231042 -26.5
Wisconsin Workplace -0.1230451 -31.0
Alabama Retail_Recreation 0.1223139 -39.0
Minnesota Workplace -0.1218446 -33.0
North Carolina Parks -0.1216284 7.0
Oregon Parks 0.1198795 16.5
New Hampshire Grocery_Pharmacy -0.1196930 -6.0
Kentucky Transit 0.1168442 -31.0
Washington Grocery_Pharmacy 0.1155905 -7.0
Mississippi Retail_Recreation -0.1148441 -40.0
Minnesota Parks 0.1126830 -9.0
Arkansas Transit 0.1118509 -27.0
Indiana Parks -0.1110087 29.0
Maryland Transit -0.1104801 -39.0
Arkansas Workplace -0.1101179 -26.0
West Virginia Residential -0.1095870 11.0
Mississippi Workplace -0.1091582 -33.0
South Dakota Residential 0.1082829 15.0
Ohio Residential 0.1070232 14.0
New Hampshire Transit -0.1064143 -57.0
Arizona Workplace -0.1022458 -35.0
Michigan Grocery_Pharmacy -0.1016359 -11.0
Kentucky Grocery_Pharmacy 0.1013980 4.0
Nebraska Retail_Recreation 0.0985741 -36.0
Pennsylvania Residential 0.0964245 15.0
Wisconsin Grocery_Pharmacy 0.0948362 -1.0
New York Residential 0.0913429 17.5
Minnesota Retail_Recreation 0.0912603 -40.0
Georgia Parks 0.0896830 -6.0
Texas Grocery_Pharmacy 0.0894631 -14.0
South Dakota Grocery_Pharmacy 0.0882891 -9.0
Virginia Workplace -0.0879471 -31.5
Washington Parks 0.0878860 -3.5
Missouri Transit -0.0868956 -24.5
Wyoming Residential 0.0861765 12.5
Oklahoma Grocery_Pharmacy -0.0855358 -1.0
Indiana Workplace 0.0854548 -34.0
Oregon Workplace -0.0773499 -31.0
South Carolina Transit 0.0770970 -45.0
Indiana Grocery_Pharmacy -0.0764024 -5.5
Michigan Residential 0.0754230 15.0
Virginia Retail_Recreation -0.0750237 -35.0
Kentucky Retail_Recreation 0.0726880 -29.0
Ohio Grocery_Pharmacy 0.0684008 0.0
Ohio Retail_Recreation 0.0666071 -36.0
Colorado Transit 0.0665124 -36.0
Iowa Retail_Recreation -0.0659564 -38.0
Tennessee Parks -0.0608996 10.5
Michigan Transit 0.0592398 -46.0
Florida Transit -0.0580431 -49.0
Nevada Workplace 0.0577436 -40.0
Minnesota Grocery_Pharmacy 0.0568082 -6.5
Iowa Workplace -0.0567880 -30.0
Colorado Retail_Recreation -0.0526941 -44.0
Colorado Grocery_Pharmacy -0.0520563 -17.0
Kentucky Residential 0.0518953 12.0
Oklahoma Workplace 0.0516081 -31.0
Tennessee Transit -0.0513821 -32.0
South Carolina Retail_Recreation -0.0508965 -35.0
Washington Retail_Recreation -0.0492306 -42.0
Oregon Transit 0.0488055 -27.5
South Carolina Grocery_Pharmacy 0.0484048 1.0
North Carolina Retail_Recreation 0.0483940 -34.0
Texas Retail_Recreation 0.0472376 -40.0
Illinois Retail_Recreation 0.0455361 -40.0
Missouri Retail_Recreation -0.0440502 -36.0
New Hampshire Workplace 0.0439213 -37.0
Kentucky Workplace -0.0432285 -36.0
Nebraska Transit -0.0424738 -9.0
Nevada Parks 0.0424460 -12.5
Missouri Parks 0.0417652 0.0
Missouri Grocery_Pharmacy 0.0416519 2.0
South Dakota Workplace 0.0401614 -35.0
Vermont Transit 0.0363188 -63.0
Illinois Grocery_Pharmacy -0.0354806 2.0
Arizona Parks -0.0328933 -44.5
West Virginia Retail_Recreation 0.0322078 -38.5
Oklahoma Retail_Recreation 0.0301780 -31.0
Florida Grocery_Pharmacy 0.0293090 -14.0
Tennessee Grocery_Pharmacy 0.0263813 6.0
Indiana Transit 0.0231145 -29.0
Ohio Workplace -0.0218207 -35.0
Kansas Residential -0.0192335 13.0
Maryland Parks -0.0148355 27.0
Colorado Workplace -0.0148225 -39.0
Iowa Grocery_Pharmacy 0.0146453 4.0
Kansas Retail_Recreation -0.0142039 -38.0
Nevada Grocery_Pharmacy 0.0138284 -12.5
Georgia Transit -0.0130591 -35.0
West Virginia Transit -0.0104711 -45.0
North Dakota Residential -0.0100749 17.0
Wisconsin Retail_Recreation 0.0061984 -44.0
Iowa Residential -0.0049720 13.0
New Mexico Workplace 0.0045458 -34.0
Alabama Residential 0.0042165 11.0
Oklahoma Transit 0.0040708 -26.0
Arkansas Grocery_Pharmacy 0.0036017 3.0
Tennessee Retail_Recreation -0.0031976 -30.0
Minnesota Residential 0.0027019 17.0
Mississippi Parks -0.0015905 -25.0
Utah Grocery_Pharmacy 0.0007774 -4.0
Alaska Parks NA 29.0
District of Columbia Retail_Recreation NA -69.0
District of Columbia Grocery_Pharmacy NA -28.0
District of Columbia Parks NA -65.0
District of Columbia Transit NA -69.0
District of Columbia Workplace NA -48.0
District of Columbia Residential NA 17.0
# sanity check
ggplot(filter(plot_data,Province.State %in% c("Pennsylvania","Maryland","New Jersey","California","Delaware","Connecticut")),aes(x=Total_confirmed_cases.per100,fill=variable))+geom_histogram()+
  facet_grid(~Province.State)+
    default_theme+
  theme(legend.position = "bottom")
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

write_plot(mobility.plot,wd = results_dir)
## [1] "/Users/stevensmith/Projects/coronavirus/results/mobility.plot.png"
write_plot(mobility.global.plot,wd = results_dir)
## [1] "/Users/stevensmith/Projects/coronavirus/results/mobility.global.plot.png"
(plot_data.permobility_summary.plot<-ggplot(plot_data.permobility_summary,aes(x=variable,y=median_change))+
  geom_jitter(size=2,width=.2)+
  #geom_jitter(data=plot_data.permobility_summary %>% arrange(-abs(median_change)) %>% head(n=15),aes(col=Province.State),size=2,width=.2)+
  default_theme+
  ggtitle("Per-Sate Median Change in Mobility")+
  xlab("Mobility Meaure")+
  ylab("Median Change from Baseline"))

write_plot(plot_data.permobility_summary.plot,wd = results_dir)
## [1] "/Users/stevensmith/Projects/coronavirus/results/plot_data.permobility_summary.plot.png"

DELIVERABLE MANIFEST

The following link to commited documents pushed to github. These are provided as a convienence, but note this is a manual process. The generation of reports, plots and tables is not coupled to the execution of this markdown. ## Report This report, html & pdf

Plots

github_root<-"https://github.com/sbs87/coronavirus/blob/master/"

plot_handle<-c("Corona_Cases.world.long.plot",
               "Corona_Cases.world.loglong.plot",
               "Corona_Cases.world.mortality.plot",
               "Corona_Cases.world.casecor.plot",
               "Corona_Cases.city.long.plot",
               "Corona_Cases.city.loglong.plot",
               "Corona_Cases.city.mortality.plot",
               "Corona_Cases.city.casecor.plot",
               "Corona_Cases.city.long.normalized.plot",
               "Corona_Cases.US_state.lm.plot",
               "Corona_Cases.US_state.summary.plot")


deliverable_manifest<-data.frame(
  name=c("World total & death cases, longitudinal",
         "World log total & death cases, longitudinal",
         "World mortality",
         "World total & death cases, correlation",
         "City total & death cases, longitudinal",
         "City log total & death cases, longitudinal",
         "City mortality",
         "City total & death cases, correlation",
         "City population normalized total & death cases, longitudinal",
         "State total cases (select) with linear model, longitudinal",
         "State total cases, longitudinal"),
  plot_handle=plot_handle,
  link=paste0(github_root,"results/",plot_handle,".png")
)


(tmp<-data.frame(row_out=apply(deliverable_manifest,MARGIN = 1,FUN = function(x) paste(x[1],x[2],x[3],sep=" | "))))
##                                                                                                                                                                                                        row_out
## 1                                           World total & death cases, longitudinal | Corona_Cases.world.long.plot | https://github.com/sbs87/coronavirus/blob/master/results/Corona_Cases.world.long.plot.png
## 2                                 World log total & death cases, longitudinal | Corona_Cases.world.loglong.plot | https://github.com/sbs87/coronavirus/blob/master/results/Corona_Cases.world.loglong.plot.png
## 3                                                         World mortality | Corona_Cases.world.mortality.plot | https://github.com/sbs87/coronavirus/blob/master/results/Corona_Cases.world.mortality.plot.png
## 4                                      World total & death cases, correlation | Corona_Cases.world.casecor.plot | https://github.com/sbs87/coronavirus/blob/master/results/Corona_Cases.world.casecor.plot.png
## 5                                              City total & death cases, longitudinal | Corona_Cases.city.long.plot | https://github.com/sbs87/coronavirus/blob/master/results/Corona_Cases.city.long.plot.png
## 6                                    City log total & death cases, longitudinal | Corona_Cases.city.loglong.plot | https://github.com/sbs87/coronavirus/blob/master/results/Corona_Cases.city.loglong.plot.png
## 7                                                            City mortality | Corona_Cases.city.mortality.plot | https://github.com/sbs87/coronavirus/blob/master/results/Corona_Cases.city.mortality.plot.png
## 8                                         City total & death cases, correlation | Corona_Cases.city.casecor.plot | https://github.com/sbs87/coronavirus/blob/master/results/Corona_Cases.city.casecor.plot.png
## 9  City population normalized total & death cases, longitudinal | Corona_Cases.city.long.normalized.plot | https://github.com/sbs87/coronavirus/blob/master/results/Corona_Cases.city.long.normalized.plot.png
## 10                     State total cases (select) with linear model, longitudinal | Corona_Cases.US_state.lm.plot | https://github.com/sbs87/coronavirus/blob/master/results/Corona_Cases.US_state.lm.plot.png
## 11                                      State total cases, longitudinal | Corona_Cases.US_state.summary.plot | https://github.com/sbs87/coronavirus/blob/master/results/Corona_Cases.US_state.summary.plot.png
row_out<-apply(tmp, 2, paste, collapse="\t\n")
name handle link
World total & death cases, longitudinal Corona_Cases.world.long.plot https://github.com/sbs87/coronavirus/blob/master/results/Corona_Cases.world.long.plot.png
World log total & death cases, longitudinal Corona_Cases.world.loglong.plot https://github.com/sbs87/coronavirus/blob/master/results/Corona_Cases.world.loglong.plot.png
World mortality Corona_Cases.world.mortality.plot https://github.com/sbs87/coronavirus/blob/master/results/Corona_Cases.world.mortality.plot.png
World total & death cases, correlation Corona_Cases.world.casecor.plot https://github.com/sbs87/coronavirus/blob/master/results/Corona_Cases.world.casecor.plot.png
City total & death cases, longitudinal Corona_Cases.city.long.plot https://github.com/sbs87/coronavirus/blob/master/results/Corona_Cases.city.long.plot.png
City log total & death cases, longitudinal Corona_Cases.city.loglong.plot https://github.com/sbs87/coronavirus/blob/master/results/Corona_Cases.city.loglong.plot.png
City mortality Corona_Cases.city.mortality.plot https://github.com/sbs87/coronavirus/blob/master/results/Corona_Cases.city.mortality.plot.png
City total & death cases, correlation Corona_Cases.city.casecor.plot https://github.com/sbs87/coronavirus/blob/master/results/Corona_Cases.city.casecor.plot.png
City population normalized total & death cases, longitudinal Corona_Cases.city.long.normalized.plot https://github.com/sbs87/coronavirus/blob/master/results/Corona_Cases.city.long.normalized.plot.png
State total cases (select) with linear model, longitudinal Corona_Cases.US_state.lm.plot https://github.com/sbs87/coronavirus/blob/master/results/Corona_Cases.US_state.lm.plot.png
State total cases, longitudinal Corona_Cases.US_state.summary.plot https://github.com/sbs87/coronavirus/blob/master/results/Corona_Cases.US_state.summary.plot.png

Tables

CONCLUSION

Overall, the trends of COVID-19 cases is no longer in log-linear phase for world or U.S. (but some regions like MD are still in the log-linear phase). Mortality rate (deaths/confirmed RNA-based cases) is >1%, with a range depending on region. Mobility is not a strong indicator of caseload (U.S. data).

See table below for detailed breakdown.

Question Answer
What is the effect on social distancing, descreased mobility on case load?
There is not a strong apparent effect on decreased mobility (work, grocery, retail) or increased mobility (at residence, parks) on number of confirmed cases, either as a country (U.S.) or state level. California appears to have one of the best correlations, but this is a mixed bag
What is the trend in cases, mortality across geopgraphical regions?
The confirmed total casees and mortality is overall log-linear for most countries, with a trailing off beginning for most (inlcuding U.S.). On the state level, NY, NJ, PA starting to trail off; MD is still in log-linear phase. Mortality and case load are highly correlated for NY, NJ, PA, MD. The mortality rate flucutates for a given region, but is about 3% overall.

END

End: ##—— Fri May 22 11:30:20 2020 ——##

Cheatsheet: http://rmarkdown.rstudio.com>

Sandbox

# Geographical heatmap!
install.packages("maps")
library(maps)
library
mi_counties <- map_data("county", "pennsylvania") %>% 
  select(lon = long, lat, group, id = subregion)
head(mi_counties)

ggplot(mi_counties, aes(lon, lat)) + 
  geom_point(size = .25, show.legend = FALSE) +
  coord_quickmap()
mi_counties$cases<-1:2226
name_overlaps(metadata,Corona_Cases.US_state)

tmp<-merge(Corona_Cases.US_state,metadata)
ggplot(filter(tmp,Province.State=="Pennsylvania"), aes(Long, Lat, group = as.factor(City))) +
  geom_polygon(aes(fill = Total_confirmed_cases), colour = "grey50") + 
  coord_quickmap()


ggplot(Corona_Cases.US_state, aes(Long, Lat))+
  geom_polygon(aes(fill = Total_confirmed_cases ), color = "white")+
  scale_fill_viridis_c(option = "C")
dev.off()


require(maps)
require(viridis)

world_map <- map_data("world")
ggplot(world_map, aes(x = long, y = lat, group = group)) +
  geom_polygon(fill="lightgray", colour = "white")

head(world_map)
head(Corona_Cases.US_state)
unique(select(world_map,c("region","group"))) %>% filter()

some.eu.countries <- c(
  "US"
)
# Retrievethe map data
some.eu.maps <- map_data("world", region = some.eu.countries)

# Compute the centroid as the mean longitude and lattitude
# Used as label coordinate for country's names
region.lab.data <- some.eu.maps %>%
  group_by(region) %>%
  summarise(long = mean(long), lat = mean(lat))

unique(filter(some.eu.maps,subregion %in% Corona_Cases.US_state$Province.State) %>% select(subregion))
unique(Corona_Cases.US_state$Total_confirmed_cases.log)
ggplot(filter(Corona_Cases.US_state,Date=="2020-04-17") aes(x = Long, y = Lat)) +
  geom_polygon(aes( fill = Total_confirmed_cases.log))+
  #geom_text(aes(label = region), data = region.lab.data,  size = 3, hjust = 0.5)+
  #scale_fill_viridis_d()+
  #theme_void()+
  theme(legend.position = "none")
library("sf")
library("rnaturalearth")
library("rnaturalearthdata")

world <- ne_countries(scale = "medium", returnclass = "sf")
class(world)
ggplot(data = world) +
    geom_sf()

counties <- st_as_sf(map("county", plot = FALSE, fill = TRUE))
counties <- subset(counties, grepl("florida", counties$ID))
counties$area <- as.numeric(st_area(counties))
#install.packages("lwgeom")
class(counties)
head(counties)
ggplot(data = world) +
    geom_sf(data=Corona_Cases.US_state) +
    #geom_sf(data = counties, aes(fill = area)) +
  geom_sf(data = counties, aes(fill = area)) +
   # scale_fill_viridis_c(trans = "sqrt", alpha = .4) +
    coord_sf(xlim = c(-88, -78), ylim = c(24.5, 33), expand = FALSE)


head(counties)
tmp<-unique(select(filter(Corona_Cases.US_state,Date=="2020-04-17"),c(Lat,Long,Total_confirmed_cases.per100)))
st_as_sf(map("county", plot = FALSE, fill = TRUE))

join::inner_join.sf(Corona_Cases.US_state, counties)

library(sf)
library(sp)

nc <- st_read(system.file("shape/nc.shp", package="sf"))
class(nc)


spdf <- SpatialPointsDataFrame(coords = select(Corona_Cases.US_state,c("Lat","Long")), data = Corona_Cases.US_state,
                               proj4string = CRS("+proj=longlat +datum=WGS84 +ellps=WGS84 +towgs84=0,0,0"))

head(spdf)
class(spdf)
st_cast(spdf)

filter(Corona_Cases.US_state.summary,Date=="2020-04-20" & Province.State %in% top_states_modified)
id

https://stevenbsmith.net